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ABSTRACT 


For two independent groups, let Mj{x) be some conditional measure of location for the 
jth group associated with some random variable Y, given that some covariate X = x. When 
Mj{x) is a robust measure of location, or even some conditional quantile of Y, given X, 
methods have been proposed and studied that are aimed at testing Hq: Mi[x) = M 2 {x) that 
deal with curvature in a flexible manner. In addition, methods have been studied where 
the goal is to control the probability of one or more Type I errors when testing Hq for 
each X G {xi,... ,Xp}. This paper suggests a method for testing the global hypothesis Hq\ 
Mi{x) = M 2 {x) for \/x G {xi,... ,Xp} when using a robust or quantile location estimator. 
An obvious advantage of testing p hypotheses, rather than the global hypothesis, is that it 
can provide information about where regression lines differ and by how much. But the paper 
summarizes three general reasons to suspect that testing the global hypothesis can have more 
power. Data from the Well Elderly 2 study illustrate that testing the global hypothesis can 
make a practical difference. 

Keywords: ANCOVA, trimmed mean, non-parametric regression, Harrell-Davis estima¬ 
tor, bootstrap methods, comparing quantiles. Well Elderly 2 study 


1 Introduction 

For two independent groups, consider the situation where for the jth group (j = 1, 2) Yj is 
some outcome variable of interest and Xj is some covariate. The classic ANCOVA method 
assumes that 

Yj = Poj + l^iXj + e, (1) 

where /3oj and /3i are unknown parameters and e is a random variable having a normal 
distribution with mean zero and unknown variance So the regression lines are assumed 
to be parallel and the goal is to compare the intercepts based in part on a least squares 
estimate of the regression lines. It is well known, however, that there are serious concerns 
with this approach. First, there is a vast literature establishing that methods based on 
means, including least squares regression, are not robust (e.g., Staudte and Sheather, 1990; 
Marrona et ah, 2006; Heritier et ah, 2007; Hampel et ah, 1986; Huber and Ronchetti, 2009; 
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Wilcox, 2012a, 2012b). A general concern is that violations of underlying assumptions can 
result in relatively poor power and poor control over the Type I error probability. Moreover, 
even a single outlier can yield a poor £t to the bulk of the points when using least squares 
regression. 

As is evident, one way of dealing with non-normality is to use some rank-based tech¬ 
nique. But rank-based ANCOVA methods are aimed a testing the hypothesis of identical 
distributions (e.g., Lawson, 1983). So when this method rejects, it is reasonable to conclude 
that the distributions differ in some manner, but the details regarding how they differ, and 
by how much, are unclear. One way of gaining some understanding of how the groups differ, 
but certainly not the only way, is to compare the groups using some measure of location. 
Here the goal is to make inferences about some robust (conditional) measure of location 
associated with V. 


Yet another fundamental concern with ([^ is that the true regression lines are assumed to 
be straight. Certainly, in some situations, this is a reasonable approximation. When there 
is curvature, simply meaning that the regression line is not straight, using some obvious 
parametric regression model might suffice. (For example, include a quadratic term.) But 
this approach can be inadequate, which has led to a substantial collection of nonparametric 
methods, often called smoothers, for dealing with curvature in a more flexible manner (e.g., 
Hardle, 1990; Efromovich, 1999; Eubank , 1999; Fox, 2001; Gyorfi, et ah, 2002). 

Here, the model given by ([^ is replaced with the less restrictive model 

i-j = /j(V) + (2) 


where fj{Xj) is some unknown function that reflects some conditional measure of location 
associated with Y given that the covariate value is Xj. The random variable ej has some 
unknown distribution with variance ct|. So unlike the classic approach where it is assumed 
that 

f,{X^) = (3oj + (3,jX^, 


no parametric model for fj{Xj) is specihed and af = is not assumed. Let Mj{x) be some 
(conditional) measure of location associated with Yj given that Xj = x. Here, curvature is 
addressed using a running interval smoother. Roughly, like all smoothers, the basic strategy 
is to focus on the Xj values close to x and use the corresponding Yj values to estimate Mj{x). 
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An appeal of the running interval smoother is that it is easily applied when using any robust 
measure of location. The details are given in the next section of this paper. 

The goal here is to test the global hypothesis 


Hq : Mi{x) = M2 {x), Vx e {xi,.. .,Xp}, (3) 

where Xi,... ,Xp are p values of the covariate chosen empirically in a manner aimed at captur¬ 
ing any curvature that might exist. Roughly, these p values are chosen using a component of 
the so-called running interval smoother, which is described in section 2. Put in more substan¬ 
tive terms, the goal is to determine whether two groups differ (e.g., depressive symptoms 
among males and females) taking into account the possibility that the extent they differ 
might depend in a non-trivial manner on some covariate (such as the cortisol awakening 
response). 

In the context of ANCOVA, use of the running interval smoother is not new. In particular 
Wilcox (1997) proposed and studied a method that tests Hq- Afi(xfc) = M 2 (xk) for each fc, 
k = 1,..., p. So p hypotheses are tested rather than the global hypothesis corresponding to 
(|^. The method is based in part on Yuen’s (1974) method for comparing trimmed means 
with the familywise error rate (the probability of one or more Type I errors) controlled using 
a strategy that is similar to Dunnett’s (1980) T3 technique. More recently, a bootstrap 
variation was proposed and studied by Wilcox (2009). Now the familywise error rate can be 
controlled using some improvement on the Bonferroni method (e.g., Rom, 1990; Hochberg, 
1988). The bootstrap method can, in principle, be used with any robust measure of location. 

However, a practical concern with testing p individual hypotheses, rather than a global 
hypothesis, is that power might be relatively low for three general reasons. First, each indi¬ 
vidual hypothesis uses only a subset of the available data. In contrast, the global hypothesis 
used here is based on all of the data that are used to test the individual hypotheses. That 
is, a larger sample size is used suggesting that it might reject in situations where the none of 
individual tests is signihcant. Second, if for example the familywise error rate is set at .05, 
then Wilcox’s method uses a Type I error probability less than .05 for the individual tests, 
which again can reduce power. The third reason has to do with using a conhdence region for 
two or more parameters as opposed to conhdence intervals for each individual parameter of 
interest. It is known that in various situations, conhdence regions can result in a signihcant 
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difference even when there are non-signihcant results for the individual parameters. (For an 
illustration, see for example Wilcox, 2012b, p. 690.) The method proposed here for testing 
(|^ deals with this issue in a manner that is made clear in section 3. Data from the Well 
Elderly 2 study (Jackson et ah, 2009; Clark et ah, 2011) are used to illustrate that the new 
method can make a practical difference. 

Another goal in this paper is to include simulation results on comparing (conditional) 
quartiles. Comparing medians is an obvious way of proceeding. But in some situations, 
differences in the tails of two distributions can be more important and informative than 
comparisons based on a measure of location that is centrally located (e.g., Doksum & Sievers, 
1976; Lombard, 2005). This proved to be the case in the Well Elderly 2 study for reasons 
explained in section 4. 

Note that rather than testing 0 . a seemingly natural goal is to test the hypothesis that 
Mi(x) = M 2 {x) for all possible values of x, not just those values in the set {xi,... ,Xp}. 
Numerous papers contain results on methods for accomplishing this goal when Mj{x) is 
taken to be the conditional mean of Y given that X = x. (For a list of references, see 
Wilcox, 2012a, p. 610.) But the mean is not robust and evidently little or nothing is known 
about how best to proceed when using some robust measure of location. Wilcox (2012a, 
section 11.11.5) describes a robust method based on a running interval smoother, but the 
choice for the span (the value of ij described in the next section) is dictated by the sample 
size given the goal of controlling the Type 1 error probability. That is, a suboptimal ht to the 
data might be needed. The method used here avoids this problem. Here, some consideration 
was given to an approach where a robust smoother is applied to each group and predicted 
Y values are computed for all of the observed x values. If the null hypothesis is true, the 
regression line for the differences Mi(x) — M 2 (x), versus x, should have a zero slope and 
intercept. Several bootstrap methods were considered based on this approach, but control 
over the Type 1 error probability was very poor, so no details are provided. 
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2 Description of the Proposed Method 


Following Wilcox (1997), the general strategy is to approximate the regression lines with 
a running interval smoother and then use the components of the smoother to test some 
relevant hypothesis. A portion of the method requires choosing a location estimator. As will 
be made evident, in principle any robust location estimator could be used, but here attention 
is focused on only two estimators: a 20% trimmed mean and the quantile estimator derived 
by Harrell and Davis (1982). 


Let Zi,..., Zn he any n observations. The y-trimmed mean is 


1 

n-2g 


n-g 

T. 

i=g+l 


where Z(i) < ■ ■ ■ < are the Z values written in ascending order and g = [ynj is the 
greatest integer less than or equal to yn, 0 < y < .5. The 20% trimmed mean corresponds 
to y = .2. One advantage of the 20% trimmed mean is that its efficiency compares well 
to the sample mean under normality (e.g., Rosenberger & Gasko, 1983). But as we move 
toward a more heavy-tailed distribution, the standard error of the 20% trimmed mean can 
be substantially smaller than the standard error of the mean, which can translate into sub¬ 
stantially higher power when outliers tend to occur. Another appeal of the 20% trimmed 
mean over the mean, when testing hypotheses, is that both theory and simulations indicate 
that the 20% trimmed is better at handling skewed distributions in terms of controlling the 
Type I error probability. This is not to suggest that the 20% trimmed mean dominates all 
other robust estimators that might be used. Clearly this is not the case. The only point is 
that it is a reasonable measure of location to consider for the situation at hand. 


The Harrell and Davis (1982) estimate of the gth quantile uses a weighted average of all 
the order statistics. Let U he a random variable having a beta distribution with parameters 
a = {n + l)q and b = (n -|- 1)(1 — g) and let 



The estimate of the gth quantile, based on Zi, 


<U < 



• • • 1 Zni is 


n 

Oq = '^ViZ(i). 
i=\ 


(4) 
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In terms of its standard error, Sfakianakis and Verginis (2006) show that in some sitna- 
tions the Harrell-Davis estimator competes well with alternative estimators that again nse 
a weighted average of all the order statistics, bnt there are exceptions. (Sfakianakis and 
Verginis derived alternative estimators that have advantages over the Harrell-Davis in some 
situations. But we found that when sampling from heavy-tailed distributions, the standard 
error of their estimators can be substantially larger than the standard error of 6q.) Compar¬ 
isons with other quantile estimators are reported by Parrish (1990), Sheather and Marron 
(1990), as well as Dielman, Lowry and Pfaffenberger (1994). The only certainty is that no 
single estimator dominates in terms of efficiency. For example, the Harrell-Davis estimator 
has a smaller standard error than the usual sample median when sampling from a normal 
distribution or a distribution that has relatively light tails, but for sufficiently heavy-tailed 
distributions, the reverse is true (Wilcox, 2012a, p. 87). 

To describe the details of the method for testing ([^, let (Xjj, Yij) {i = 1,... ,nj; j = 1, 

2) be a random sample of size nj from the jth group. For a chosen value for x, suppose the 
goal is to estimate Mj{x). The strategy is simple. Roughly, for each j, compute a measure 
of location based on the Y^j values for which the corresponding Xij values are close to x. 
More formally, for hxed j, compute a measure of location based on the Yij values such that 
i is an element of the set 

Pj{x) = {i : \Xij — x\ < x MADNj}, 

where is a constant chosen by the investigator and often called the span, MADNj=MADj/.6745, 
MADj (the median absolute deviation) is the median of \Xij — mj|,... — rnj \ and mj is 

the usual sample median based on ... ,Xnjj. Under normality, MADNj=MADj/.6745 
estimates the population standard deviation, in which case Xij is close to x if it is within 
£j standard deviations from x. Generally, the choice £j = .8 or £j = 1 gives good results, in 
terms of capturing any curvature, but of course exceptions are encountered. 

Let Nj{x) be the cardinality of the set Pj{x) and suppose that Mj{x) is estimated with 
some measure of location based on the Y^j values for which i G Pj{x). The two regression 
lines are dehned to be comparable at x if simultaneously Ni{x) > 12 and N 2 {x) > 12. The 
idea is that if the sample sizes used to estimate Mi{x) and M 2 {x) are sufficiently large, then 
a reasonably accurate conhdence interval for Mi(x) — M 2 {x) can be computed provided a 
reasonably level robust technique is used. For example, Yuen’s (1974) method might be used 
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with a 20% trimmed mean. (It is known that nnder fairly general conditions methods for 
comparing means are not level robnst with relatively small sample sizes. See Wilcox, 2012b, 
for details.) 

For notational convenience, let 6jk be some location estimator based on the Yij valnes 
for which i G Pj{xk). Let Sk = 9ik — 02k and let 5k denote the popnlation analog of 5k 
(A; = 1,... ,p). Then ([^ corresponds to 

/fo:hi = h2 = ---5p = 0. (5) 

The basic strategy for testing (|^ is to generate bootstrap samples from each gronp, compnte 
5k based on these bootstrap samples, repeat this B times, and then measnre how deeply the 
nnll vector 0 is nested in the bootstrap clond of points via Mahalanobis distance. Based on 
these distances, resnlts in Lin and Singh (1997) indicate how to compnte a p-valne. 

To elaborate, let Y*j) be a bootstrap sample from the jth gronp, which is obtained 
by resampling with replacement Uj pairs of points from (Xij,Yij) {i = 1,... ,nj; j = 1, 2). 
Let 5l be the estimate of 5k based on the bootstrap samples from the two gronps. Repeat 
this process B times yielding = (hjj'^,..., 5*^), b = 1,... ,B. Let S be the covariance 
matrix based on the B vectors A^,..., A^. Note that the center of the bootstrap clond 
being estimated by these B bootstrap samples is known. It is A = (A, • • •, (^p), the estimate 
of A = (5i,..., hp) based on the (A^-, Y^j) valnes. Let 

di = {Ai - A)s-\Ai - Ay, 

where for 6 = 0, Aq is taken to be the nnll vector 0. Then a (generalized) p-valne is 

lf;/(4<4), (6) 

b=i 

where the indicator fnnction I{dl < df) = 1 if dg < dl; otherwise /(dg < df) = 0. 

There remains the problem of choosing the Xk valnes. They might be chosen based 
on snbstantive gronnds, bnt of conrse stndying this strategy via simnlations is difficnlt at 
best. Here, we follow Wilcox (1997) and choose p = 5 points in a manner snggested by 
rnnning interval smoother in terms of captnring any cnrvatnre in a flexible manner. For 
notational convenience, assnme that for fixed j, the Xij valnes are in ascending order. That 


is, Xij < • • • < Xn^j- Suppose zi is taken to be the smallest Xu value for which the 
regression lines are comparable. That is, search the hrst group for the smallest Xu such that 
Ni{Xii) > 12. If N 2 {Xii) > 12, in which case the two regression lines are comparable at 
Xii, set xi = Xii. If N 2 {xii) < 12, consider the next largest xn value and continue until it 
is simultaneously true that A^i(Xii) > 12 and N 2 {Xii) > 12. Let ii be the smallest integer 
such that > 12 and N 2 {xi^i) > 12. Similarly, let x^ be the largest Xu value for 

which the regression lines are comparable. That is, x^ is the largest Xu value such that 
N\{xii) > 12 and N 2 {xii) > 12. Let is be the corresponding value of i. Let = (ii + i5)/2, 
^2 = (ii + *3)/2, and = (is + 15 )/2. Round i 2 , is, and i 4 down to the nearest integer and 
set X 2 = xs = Xigi, and x^ = 

When the covariate values are chosen in the manner just described, and p = 5 separate 
tests are performed based on some measure of location, this will be called method W hence¬ 
forth. Computing a p-value using (|^, with the goal of performing a global test, will be 
called method G. Unless stated otherwise, both methods G and W will be based on a 20% 
trimmed mean. 

Note that in essence, we have a 2-by-p ANOVA design. But for the p levels of the second 
factor, the groups are not necessarily independent. The reason is that for any two covariate 
values, say Xk and Xm, the intersection of the sets Pj{xk) and Pj{xm) is not necessarily equal 
to the empty set. Here, the strategy for dealing with this feature is to model it via a bootstrap 
method. Another approach would be divide the data into p independent groups. But there 
is uncertainty about how this might be done so as to effectively capture any curvature. The 
approach used here mimics a basic component used by a wide range of smoothers designed 
to deal with curvature in a flexible manner. 

Of course, the obvious decision rule, when using method G, is to reject the null hypothesis 
if the p-value is less than or equal to the nominal level. When testing at the a = .05 level, 
preliminary simulations indicated that this approach performs well, in term of controlling 
the Type I error probability, when p = 3 and the Xk values are taken to be the quartiles 
corresponding to the Xu values. But when p = 5 and the Xk values are chosen as just 
described, the actual level exceeded .075 when testing at the a = .05 level with ni = n 2 = 30. 
This problem persisted with ni = ^2 = 50. However, for the range of distributions considered 
(described in section 3), the actual level was found to be relatively stable. This suggests using 
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a strategy similar to Gosset’s (Student’s) approach to comparing means: Assume normality, 
determine an appropriate critical value using a reasonable test statistic, and continue using 
this critical value when dealing with non-normal distributions. 

Given ni and n 2 , this strategy is implemented by hrst by generating, for each j, tij pairs 
of observations from a bivariate normal distribution having a correlation p = 0. Based on 
this generated data, determine p = 5 values of the covariate in the manner just described 
and then compute the p-value given by ([^. Denote this p-value by p. Repeat this process 
A times yielding pi,... ,Pa- Then an a level critical p-value, say Pc, is taken to be the a 
quantile of the pi,... ,pA values, which here is estimated via the Harrell-Davis estimator. 
(With A=1000 and when a trimmed mean is used, this can be done in 14.8 seconds using 
an R function, described in the hnal section of this paper, running on the hrst author’s 
MacBook Pro.) That is, letting po denote the p-value based on the observed data, reject (|^ 
if Po < Pc- 

Note that once Pc has been determined, a 1 — a conhdence region for the vector A = 
(5i,... 5p) can be computed. A conhdence region consists of the convex hull containing the 
{l—Pc)B Ah vectors that have the smallest dl values. As previously indicated, this conhdence 
region provides a perspective on why the global test considered here can have more power 
than method W. Situations are encountered where the null vector is not contained in the 
conhdence region, yet the conhdence intervals for each of the p diherences contain zero. 


3 Simulation Results 

Simulations were used to study the small-sample properties of the proposed method with 
n\ = n 2 = 30. Smaller sample sizes are dubious because this makes it particularly difficult 
to ehectively deal with curvature. Also, hnding hve covariate values where the groups are 
comparable can be problematic. That is, Nj{x) might be so small as to make comparisons 
meaningless. A few results are reported with ni = n 2 = 100 and 200 as well. 

Estimated Type I error probabilities, a, were based on 4000 replications. The estimated 
critical p-value was based on A = 1000 and B = 500 bootstrap samples. Four types of 
distributions were used: normal, symmetric and heavy-tailed, asymmetric and light-tailed. 
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Table 1: Some properties of the g-and-h distribution. 


g 

h 

Ki 

«2 

0.0 

0.0 

0.00 

3.0 

0.0 

0.2 

0.00 

21.46 

0.2 

0.0 

0.61 

3.68 

0.2 

0.2 

2.81 

155.98 


and asymmetric and heavy-tailed. More precisely, the marginal distributions were taken 
to be one of four g-and-h distributions (Hoaglin, 1985) that contain the standard normal 
distribution as a special case. (The R function ghdist, in Wilcox, 2012a, was used to generate 
observations from a g-and-h distribution.) If Z has a standard normal distribution, then by 
dehnition 


V 


expy)-i exp(hZV2), if^>0 
Zexp(hZ^/2), if g = 0 


has a g-and-h distribution where g and h are parameters that determine the hrst four mo¬ 
ments. That is, a g-and-h distribution is a transformation of the standard normal random 
variable that can be used to generate data having a range of skewness and kurtosis values. 
The four distributions used here were the standard normal [g = h = 0.0), a symmetric 
heavy-tailed distribution {h = 0.2, g = 0.0), an asymmetric distribution with relatively light 
tails {h = 0.0, g = 0.2), and an asymmetric distribution with heavy tails {g = h = 0.2). Ta¬ 
ble 1 shows the skewness (ki) and kurtosis {k, 2 ) for each distribution. Additional properties 
of the g-and-h distribution are summarized by Hoaglin (1985). 


The g-and-h distributions with h = .2 were chosen in an attempt to span the range of 
distributions that might be enconntered in practice. The idea is that if method G performs 
well for what some might regard as an unrealistic departnre from normality, this provides 
some reassnrance that it will perform reasonably when dealing with data from an actual 
study. 


Three types of associations were considered. The hrst two deal with situations where 
Yij = (3Xij -|- e. The two choices for the slope were /3 = 0 and 1. The third type of 
association was Yij = Xfj + e. These three situations are labeled SI, S2 and S3, respectively. 
The estimated Type I errors were very similar for SI and S2, so for brevity the results for S2 


11 



Table 2: Estimated Type I error probabilities when testing at the a = .05 level, ni = n 2 = 30 


9 

h 

Estimator 

SI 

S3 

0.0 

0.0 

7 = 

.2 

.048 

.048 

0.0 

0.0 

9 = 

.50 

.038 

.044 

0.0 

0.0 

9 = 

.75 

.049 

.048 

0.0 

0.2 

7 = 

.2 

.022 

.026 

0.0 

0.2 

9 = 

.50 

.023 

.028 

0.0 

0.2 

9 = 

.75 

.029 

.028 

0.2 

0.0 

7 = 

.2 

.040 

.047 

0.2 

0.0 

9 = 

.25 

.053 

.056 

0.2 

0.0 

9 = 

.50 

.036 

.044 

0.2 

0.0 

9 = 

.75 

.046 

.045 

0.2 

0.2 

7 = 

.2 

.020 

.024 

0.2 

0.2 

9 = 

.25 

.040 

.040 

0.2 

0.2 

9 = 

.50 

.022 

.028 

0.2 

0.2 

9 = 

.75 

.026 

.025 


are not reported. The Xij values were generated from a standard normal distribution and e 
was generated from one of the four g-and-h distributions previously indicated. 

The simulation results are reported in Table 2 . As can be seen, when testing at the 
.05 level, the actual level was estimated to be less than or equal to .056 among all of the 
situations considered. Although the seriousness of a Type I error depends on the situation, 
Bradley (1978) suggests that as a general guide, when testing at the .05 level, the actual 
level should be between .025 and .075. Based on this criterion, the only concern is that for 
a very heavy-tailed distribution, the estimated level drops below .025, the lowest estimate 
being .020. Increasing both sample sizes to 50 corrects this problem. For example, with 
g = h = .2 and 7 = . 2 , the estimate for situation SI increases from .020 to .034. 

Notice that the lowest estimates in Table 2 occur for 7 = .2 when g = h = .2. Simulations 
were run again with rii = n 2 = 100 as well as ni = n 2 = 200 as a partial check on the impact 
of using larger sample sizes. The estimated Type I error probabilities for these two situations 
were .036 and .040, respectively. 
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As previously explained, there are at least three reasons to expect that the global test 
will have more power than method W. The extent this is true depends on the situation. 
To provide at least some perspective, consider the case where the covariate has a normal 
distribution and the error term has a g-and-h distribution. First consider g = h = Q 
(normality), and suppose the hrst group has /di = /do = 0 , while for the second group 
Y = .5 + e. With ni = n 2 = 50, and testing at the .05 level, the power of method G test 
was estimated to be .51. The probability of rejecting at one or more design points using 
method W was estimated to be .38. If instead Y = .5X + .5 + e for the second group, the 
power estimates are now .75 and .66, respectively. If y = .5X^ + .5 + e, the estimates are 
.89 and .78. For this last situation, if {g,h) = (0,.2), the estimates are .76 and .70. For 
{g, h) = (.2, .2) the estimates are .75 and .70. So all indications are that W has more power, 
with the increase in power estimated to be as high as .12 among the situations considered 
here. 

As already noted, a well-known argument for using a 20% trimmed mean, rather than 
the mean, is that under normality its efficiency compares very well to to the mean, but as we 
move toward a heavy-tailed distribution, the standard error of the mean can be substantially 
larger than the standard error of the 20% trimmed. That is, in terms of power, there is little 
separating the mean and 20% under normality, but for heavier tailed distributions, power 
might be substantially higher using a 20% trimmed mean. For the situation at hand, consider 
again g = h = and Y = .5-|-e, only now method W is applied using means rather than 20% 
trimmed means. Now power is estimated to .43, slightly better than using a 20% trimmed 
for which power was estimated to be .38. Using instead method G, power was estimated to 
be .56. So again, method G offers more power than method W and power is a bit higher 
compared to using a 20% trimmed mean, which was .51. For {g, h) = (0, .2), now the power 
of method W was estimated to .25 when using a mean compared to .48 when using a 20% 
trimmed mean. More relevant to the present paper is that if method G is used with a mean, 
power is estimated to be .30, which is substantially smaller than the estimate of .51 when 
using a 20% trimmed mean. 
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4 Illustrations 


There is the issue of whether method G can reject when method W does not when dealing 
with data from an actual study. There is also the issue of whether comparing quartiles makes 
a practical difference. We report results relevant to these issues using data from the Well 
Elderly 2 study. 

A general goal in the Well Elderly 2 study was to assess the efficacy of an intervention 
strategy aimed at improving the physical and emotional health of older adults. (The data 
are available at http://www.icpsr.umich.edu/icpsrweb/landing.jsp,) A portion of the study 
was aimed at understanding the impact of intervention on depressive symptoms as measured 
by the Center for Epidemiologic Studies Depressive Scale (CES-D). The CES-D (Radloff, 
1977) is sensitive to change in depressive status over time and has been successfully used 
to assess ethnically diverse older people (Lewinsohn et ah, 1988; Foley et ah, 2002). Higher 
scores indicate a higher level of depressive symptoms. Another dependent variable was 
the RAND 36-item Health Survey (SF-36), a measure of self-perceived physical health and 
mental well-being (Hays, 1993; McHorney et ah, 1993). Higher scores reflect greater health 
and well-being. 

Before intervention and six months following intervention, saliva samples were taken at 
four times over the course of a single day: on rising, 30 min after rising, but before taking 
anything by mouth, before lunch, and before dinner. Then samples were assayed for cortisol. 
Extant studies (e.g.. Clow et ah, 2004; Chida & Steptoe, 2009) indicate that measures of 
stress are associated with the cortisol awakening response (CAR), which is defined as the 
change in cortisol concentration that occurs during the first hour after waking from sleep. 
(CAR is taken to be the cortisol level upon awakening minus the level of cortisol after the 
participants were awake for about an hour.) Here, the goal is to compare males and females 
after intervention based on CES-D and SF-36 measures using the CAR as a covariate. 

To illustrate that in practice the global test can reject when method W does not, and 
that comparing lower or upper quantiles can make a practical difference, consider the goal 
of comparing males and females based on CES-D measures using CAR as a covariate. No 
differences are detected based on a 20% trimmed mean or median when using method W as 
well as the global test proposed here. This remains the case when comparing .25 quantiles 


14 



using a bootstrap version of method W. But when using method G to compare the groups 
based on the .25 quantile, a signihcant difference is found . That is, there was no signihcant 
difference between males and females based on a measure of location intended to reflect the 
typical response. But the results indicate that there is a sense in which males tend to have 
even lower CES-D scores than females. 

For the SF-36, testing ([^ based on the median, a signihcant difference is found at the .05 
level. (There were 75 males and 171 females after eliminating missing values.) Figure 1 shows 
a plot of the regression lines where the solid lines is the regression line for males. For the 
males there were 6 outliers among the CAR values and for the females there were 8 outliers 
(based on a boxplot), which were eliminated from the analysis and are not shown in Figure 
1. (Eliminating outliers among the independent variable is allowed. It is eliminating outliers 
among the dependent variable that can cause technical problems.) For the situation in Figure 
1, a bootstrap version of method W indicates signihcant diherences when CAR is negative 
(cortisol increases shortly after awakening). In practical terms, the results indicated that 
the typical males perceived health and well being scores are higher among individuals whose 
cortisol levels increase after awakening. When cortisol decreases, no signihcant diherence 
between males and females is found. Moreover, there appears to be little or no association 
between the CAR and SF-36 among women. For men, again there is no signihcant association 
when cortisol increases. But when cortisol decreases, a negative association is found. (The 
slope dihers signihcantly from zero, p = .03, when htting a straight line regression via a 
generalization of the Theil-Sen estimator that is designed to handle tied values.) 

Note that in Figure 1, there appears to be curvature for the males. A test of the hypothesis 
that the regression line is straight was performed using the R function qrchk in Wilcox 
(2012b, p. 544). If again the six outliers among the independent variable are eliminated, 
the hypothesis of a straight line is rejected at the .05 level {p = .046). If the outliers are 
retained, now p = .005. So the results suggest that as CAR increases, there is little change 
in the typical SF-36 value when CAR is negative. But for CAR positive, the typical SF-36 
value for males decreases. 

To add perspective. Figure 2 shows the least squares regression lines for the same data 
used in Figure 1. If the classic ANCOVA method is applied, the slopes do not differ sig¬ 
nihcantly at the .05 level {p = .16) and intercepts do differ signihcantly {p = .008). But 
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Figure 1: Regression lines for predicting perceived health and well-being. The independent 
variable is the cortisol awakening response. The solid line is the .5 quantile regression line 
for males. 
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Figure 2: The least squares regression lines for predicting perceived health and well-being 
using the same data shown in Figure 1. Again, the solid line is the regression line for males. 

Figure 1 suggests that there is a distinct bend approximately where CAR is equal to —.1. 
Indeed, the least squares estimates of slope for males, based on the CAR values greater — .1, 
differs significantly from the slope for females using a method that allows heteroscedasticity, 
p = .011. (Heteroscedasticity was addressed be estimating the standard errors via the HC4 
estimator. See for example Wilcox, 2012a, p. 242. Again CAR values flagged as outliers 
by a boxplot were removed.) Using instead the Theil-Sen estimator, again the slopes are 
signihcantly different, p = .047. 
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5 Concluding Remarks 


In summary, all indications are that method G avoids Type I errors well above the nominal 
level. The highest estimated Type I error probability was .056 when testing at the .05 level. 
The only known concern is that when dealing with a very heavy-tailed distribution, the Type 
I error probability might be less than .025 with sample sizes of 30. Increasing the sample 
sizes to 50, this problem was avoided among the situations considered. 

It is unclear under what circumstances some asymptotic result might be used to determine 
an appropriate critical value. The answer depends on the sample sizes, the span used by 
the running interval smoother {^l and £ 2 ) and the number of covariate values used. But this 
would seem to be a minor inconvenience in most situations because a critical value can be 
determined fairly quickly using the method described in the paper. Even with sample sizes 
of 300, execution time was only 39.5 seconds on a MacBook Pro. 

It is not being suggested that method G dominates all approaches relevant to ANGOVA. 
It seems fairly evident that no single method dominates, one reason being that different 
methods are sensitive to different features of the data. Rather, method G provides an 
approach to ANGOVA that might have practical value in various situations, as was illustrated 
using the Well Elderly data. Here, for example, by dealing with curvature in a flexible 
manner, coupled with a robust measure of location, the results indicated that when GAR is 
negative, typical SF-36 scores for males tend to be higher than scores for females. The extent 
they differ appears to have little to do with the value of GAR. But for GAR greater than zero, 
this is no longer the case. The differences between males and females tend to decrease as 
GAR increases. Both classic ANGOVA and robust methods indicate that males tend to have 
higher SF-36 scores. But the robust methods provide a more detailed picture regarding when 
this is this case. Method G is just one tool that helps provide a more detailed understanding 
of data beyond the non-robust and less flexible approach based on classic ANGOVA methods. 
Put in broader terms, is there a single number or a single method that tells us everything 
we would like to know about how groups compare? We would suggest that the answer is no. 
Method G is aimed at dealing with this issue. 

Finally, R software is available for applying method G. The function ancGLOB per¬ 
forms the calculations and is stored on the first author’s web page. For faster execution 
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time, C++ subroutines have been written that compute the critical p-value. To take ad¬ 
vantage of these subroutines, first install the R package devtools with the R command 
install.packages(“devtools”). Then the C++ subroutines can be installed with the following 
commands: 

library("devtools") 

install_github( "WRScpp", "mrxiaohe") 

Finally, when using the R function ancGLOB, set the argument cpp=TRUE. 
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